On Biases in Estimating Multi-Valued Attributes

نویسنده

  • Igor Kononenko
چکیده

We analyse the biases of eleven mtasures for estimating the quality of the mult ivalued attributes The values of information gain Jmeasure, gini-index and relevance tend to lin early increase with the number of values of an attr ibute The values of gam-ratio dis tance measure, Relief and the weight of evidence decrease for informative attributes and increase for irrelevant attributes The bias of the statistic tests based on the chi-square distr ibution is similar but these functions are not able to discriminate among The attributes of different quality We also introduce a new func tion based on the MDL principle whose value slightly decreases with the increasing number of attribute s values 1 I n t r o d u c t i o n In top down induction of decision trees various impurity functions are used to estimate the quality of attributes in order to select the "best one to split on However various heuristics tend to overestimate the multi valued attnbules One possible approach to this problem in top down induction of decision trees is the construction of binary decision trees The other approach is to introduce a kind of normalization into the selection criterion such as gam-ratio [Quinlan, 1986] and distance measure [Mantaras, 1989] Recently White and Liu [1994] showed that, even with normalization information based heuristics still tend to overestimate the attributes with more values Their experiments indicated that \ and G statistics are superior estimation techniques to information gain gain ratio and distance measure They used the Monte Carlo simulation technique to generate artificial data sets with at tributes wi th various numbers of values However, their scenario included only random attributes with the uni form distribution over attributes' values generated independently of the class The purpose of our investigation is to verify the conclusions of White and Liu in mort realistic situa tions where attributes art informative and/or ha\< nonuniform distribution of a l tnbut t s values We adopted and extended their scenario m order to verify results of methods tested b\ White and Liu and to lest also some oilier well known measures gini-index [Breiman et al 1984] J measure [Smyth and Goodman 1990] the weight of evidence [Miclue 1989], and relevance [Baim 1988] Besides we developed and tested also one, new selection measure based on the minimum description length (MDL) principle and a meassure derived from the algorithm RELIEF [Kira and Rendell 1992] In the following we describe all selection measures the experimental scenario and results We analyse the (dis)advanlagfs of variousselection measures 2 Se lec t ion measures In this section we bneflv describe all selection measures and develop A lit w one based on the M D L principle We \ssumt that all attributes are discrete and that the prob lem is lo select the best attribute among the attributes with various numbers of possible values Al l selection measures are defined in a wav that the best attribute should maximize the measure Let C A and 1 b» the number of classes th< number of attributes and the number of values of the given attribute, respectivelv Let n denote the number of training instances, n, the number of training instances from class the number of instances with the j t h value of the given attribute, and nnj the number of instances from class C, and with the the th value of the given attribute Let furthe her p,} , denote the approximation of the probabilities from the training set 2 1 I n f o r m a t i o n based measures Let Hc, HA, He A, nd Hc/A be the entropv of the classes of the values of the given attr ibute, of the joint events class attribute value, and of the classes given the value of the attribute, respectively

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Solving robot selection problem by a new interval-valued hesitant fuzzy multi-attributes group decision method

‎Selecting the most suitable robot among their wide range of specifications and capabilities is an important issue to perform the hazardous and repetitive jobs‎. ‎Companies should take into consideration powerful group decision-making (GDM) methods to evaluate the candidates or potential robots versus the selected attributes (criteria)‎. ‎In this study‎, ‎a new GDM method is proposed by utilizi...

متن کامل

A New Extended Analytical Hierarchy Process Technique with Incomplete Interval-valued Information for Risk Assessment in IT Outsourcing

Information technology (IT) outsourcing has been recognized as a new methodology in many organizations. Yet making an appropriate decision with regard to selection and use of these methodologies may impose uncertainties and risks. Estimating the occurrence probability of risks and their impacts organizations goals may reduce their threats. In this study, an extended analytical hierarchical proc...

متن کامل

Interval-Valued Hesitant Fuzzy Method based on Group Decision Analysis for Estimating Weights of Decision Makers

In this paper, a new soft computing group decision method based on the concept of compromise ratio is introduced for determining decision makers (DMs)' weights through the group decision process under uncertainty. In this method, preferences and judgments of the DMs or experts are expressed by linguistic terms for rating the industrial alternatives among selected criteria as well as the relativ...

متن کامل

Multi-valued fixed point theorems in complex valued $b$-metric spaces

‎The aim of this paper is to establish and prove some results on common fixed point‎ for a pair of multi-valued mappings in complex valued $b$-metric spaces‎. ‎Our‎ ‎results generalize and extend a few results in the literature‎.  

متن کامل

Strict fixed points of '{C}iri'{c}-generalized weak quasicontractive multi-valued mappings of integral type

‎‎Many authors such as Amini-Harandi‎, ‎Rezapour ‎et al., ‎Kadelburg ‎et al.‎‎, ‎have tried to find at least one fixed point for quasi-contractions when $alphain[frac{1}{2}‎, ‎1)$ but no clear answer exists right now and many of them either have failed or changed to a lighter version‎. In this paper‎, ‎we introduce some new strict fixed point results in the set of multi-valued '{C}iri'{c}-gener...

متن کامل

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995